Videos: https://omscs.gatech.edu/cs-7641-machine-learning-course-videos
Definition: Take examples of inputs and outputs. Now, given a new input, predict its output.
| Term | Definition |
|---|---|
| Instances | Input. Vectors of attributes that defines your input space. |
| Concept | Function. Takes the input space and maps them to an output. Concept is a mapping between objects in a world and membership in a set, which makes it a function. |
| Target Concept | The correct concept to map instances to oututs. |
| Hypothesis Class | Set of concepts that you're willing to consider in order to find the target concept (e.g. Binary or multiclass, but not considering regression output). |
| Sample | Training set. Set of all of our inputs paired with labels with the correct output. Foundation for inductive learning. |
| Candidate Concept | Concept that you think is the target concept. (e.g. Anyone with curly hair is credit-worthy) |
| Testing Set | Set of instances to test a candidate concept. |
Work through each instance and give an answer based on the decision tree.
Note:
"20 questions" example: We want to narrow possibilities with each questions asked. This motivates us to ask very general Qs that significantly narrows down possiblities.
Notes:
Example: Boolean expressions as decision tress
| Type | Size of Nodes | Description |
|---|---|---|
| n-OR ("any" fn) | linear $n$ | Easy to solve |
| n-XOR ("odd/even parity" fn)* | $O(2^{n})$ | Hard to solve. No clever way to pick the right attribute to solve the problem (need to look at all attributes). |
Takeaway: We hope that our ML problems are more like n-OR problems rather than n-XOR problems because n-XOR problems are hard and expensive.
* If number of attributes that are true is odd, then the output is true.
Q: Exactly how expressive is a decision tree?
Given:
A: The hypothesis space of decision tree is very expressive. We need to cleverly search this space to efficiently find the best representation
Loop:
1. Pick best attribute A
2. Assign A as decision attribute for Node
3. For each value A, create a descendant of Node
4. Sort training examples to leaves
5. Stop if examples perfectly classified
6. Else iterate over leaves
Definition: Reduction in randomness over the labels you have over with the set of data based upon knowing the value of a particular attribute.
Entropy: "Average amount of information conveyed by an event, when considering all possible outcomes"
Formula:
Formula of entropy: $$ -\Sigma_v p(v) log p(v) $$
Types of bias:
ID3's inductive bias - Types of preferences:
Q: How to build DTs with continuous attributes?
Q: T/F, does it make sense to repeat an attribute along a path in a tree?
Overfitting: Don't trust the data too much because there is potential for noise. This happens when trees are too big.
Solutions
Regression implies continuous outputs. Information gain doesn't work well with continuous values.
Q: What's the splitting criteria for regression trees?
Q: What's the output?
Etymology: "Regression to the mean"
Finding the best line (measured by least squared error) is achieved through calculus (ie. finding the derivative).
Training data has error that doesn't model $f$, results in $f+\epsilon$
Types of error sources:
Definition: Take n-folds of training data then use one of the folds as the "test" data. Do this with each fold, using the rest as training data. Choose the model with the lowest error.
Aside - Assumptions of SL:
- Train/test sets are IID (independent and identically distributed)
Applied to housing example in video:
Example of a perceptron (binary classifier with 3 input + weights)
Quiz: Example inputs and estimated y
Image: Representation of activation space given 2 data + weights. Generates a halfplane.
Takeaway: Perceptrons are linear functions that computes a hyperplane.
Quiz 1: If we focus on $x_1 \in {0,1}$ and $x_2 \in {0,1}$, what is $y$?
Quiz 2: If $y$ is an OR function, what would we set $w_1$, $w_2$, $\theta$?
Quiz 3: Given a NOT logic (one variable), what should we set $w_1$ and $\theta$?
Takeaway: AND/OR/NOT logic can be expressed as perceptron units, even XORs.
# Perceptron Quiz answer in code
def xor_network(x1, x2, w1, w2, w_and, theta) -> bool:
activation = x1*w1 + x2*w2 + bool(x1 and x2)*w_and
return 1 if activation >= theta else 0
def test_xor(w1, w2, w_and, theta):
assert xor_network(0,0, w1, w2, w_and, theta) == 0
assert xor_network(0,1, w1, w2, w_and, theta) == 1
assert xor_network(1,0, w1, w2, w_and, theta) == 1
assert xor_network(1,1, w1, w2, w_and, theta) == 0
print("Passed all tests")
w_1 = 1
w_2 = 1
w_and = -2
threshold = 1
test_xor(w_1, w_2, w_and, threshold)
Answer to XOR question:
Definition: Given examples, find weights that map inputs to outputs.
2 ways in which perceptron training is done:
Goal: How to set the weights of a single unit so that it matches a training set.
Components:
(1) Weight change formula: Changing weight of $w_i$ by $\Delta w_i$. This is what we iterate on while there is error.
$$ w_i = w_i + \Delta w_i $$(2) Defining weight change amount - First, calculate $\hat{y}$.
(3) Determining the weight change needed based on distance between $y$ and $\hat{y}$. This is multiplied by the unit ($x$) value, counterbalanced by the learning rate $\eta$.
$$ \Delta w_i = \eta (y-\hat{y})x_i $$Definition: All examples can be separated by a linear hyperplane. This gets more complicated to visualize with 4+ dimensions.
Perceptron Rule: If we have data that is linearly separable, then the perceptron rule will find this hyperplane.
Q: What if the data is not linearly separable?
Motivation: More robust to linearly non-separable datasets.
Components:
(1) Calculate the error metric. Goal is to minimize this error.
(2) Find the partial derivative of above.
(3) We get the result - The derivative of the error $E$ with respect to $w_i$. The result looks pretty similar to the perceptron rule.
$$ \frac{\partial E}{\partial w_i} = \Sigma_{(x,y) \in D} (y-a)(-x_i) $$| Type | Learning Rule | Pros | Cons |
|---|---|---|---|
| Perceptron Rule | $$\Delta w_i = \eta (y-\hat{y})x_i$$ | Guaranteed finite convergence | Requires linear separability |
| Gradient Descent | $$\Delta w_i = \eta (y-\alpha)x_i$$ | More robust to non-linearly seprable data | Likely to converge only on local optimum |
Quiz: Why not do gradient descent on $\hat{y}$?